Goto

Collaborating Authors

 Oldenburg




Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

arXiv.org Machine Learning

In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.


Effective Explanations for Belief-Desire-Intention Robots: When and What to Explain

arXiv.org Artificial Intelligence

When robots perform complex and context-dependent tasks in our daily lives, deviations from expectations can confuse users. Explanations of the robot's reasoning process can help users to understand the robot intentions. However, when to provide explanations and what they contain are important to avoid user annoyance. We have investigated user preferences for explanation demand and content for a robot that helps with daily cleaning tasks in a kitchen. Our results show that users want explanations in surprising situations and prefer concise explanations that clearly state the intention behind the confusing action and the contextual factors that were relevant to this decision. Based on these findings, we propose two algorithms to identify surprising actions and to construct effective explanations for Belief-Desire-Intention (BDI) robots. Our algorithms can be easily integrated in the BDI reasoning process and pave the way for better human-robot interaction with context- and user-specific explanations.


Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks

arXiv.org Artificial Intelligence

Photoplethysmography (PPG) signals encode information about relative changes in blood volume that can be used to assess various aspects of cardiac health non-invasively, e.g.\ to detect atrial fibrillation (AF) or predict blood pressure (BP). Deep networks are well-equipped to handle the large quantities of data acquired from wearable measurement devices. However, they lack interpretability and are prone to overfitting, leaving considerable risk for poor performance on unseen data and misdiagnosis. Here, we describe the use of two scalable uncertainty quantification techniques: Monte Carlo Dropout and the recently proposed Improved Variational Online Newton. These techniques are used to assess the trustworthiness of models trained to perform AF classification and BP regression from raw PPG time series. We find that the choice of hyperparameters has a considerable effect on the predictive performance of the models and on the quality and composition of predicted uncertainties. E.g. the stochasticity of the model parameter sampling determines the proportion of the total uncertainty that is aleatoric, and has varying effects on predictive performance and calibration quality dependent on the chosen uncertainty quantification technique and the chosen expression of uncertainty. We find significant discrepancy in the quality of uncertainties over the predicted classes, emphasising the need for a thorough evaluation protocol that assesses local and adaptive calibration. This work suggests that the choice of hyperparameters must be carefully tuned to balance predictive performance and calibration quality, and that the optimal parameterisation may vary depending on the chosen expression of uncertainty.


A General Approach of Automated Environment Design for Learning the Optimal Power Flow

arXiv.org Artificial Intelligence

Reinforcement learning (RL) algorithms are increasingly used to solve the optimal power flow (OPF) problem. Yet, the question of how to design RL environments to maximize training performance remains unanswered, both for the OPF and the general case. We propose a general approach for automated RL environment design by utilizing multi-objective optimization. For that, we use the hyperparameter optimization (HPO) framework, which allows the reuse of existing HPO algorithms and methods. On five OPF benchmark problems, we demonstrate that our automated design approach consistently outperforms a manually created baseline environment design. Further, we use statistical analyses to determine which environment design decisions are especially important for performance, resulting in multiple novel insights on how RL-OPF environments should be designed. Finally, we discuss the risk of overfitting the environment to the utilized RL algorithm. To the best of our knowledge, this is the first general approach for automated RL environment design.


Cognitive Prompts Using Guilford's Structure of Intellect Model

arXiv.org Artificial Intelligence

Large language models (LLMs) demonstrate strong language generation capabilities but often struggle with structured reasoning, leading to inconsistent or suboptimal problem-solving. To mitigate this limitation, Guilford's Structure of Intellect (SOI) model - a foundational framework from intelligence theory - is leveraged as the basis for cognitive prompt engineering. The SOI model categorizes cognitive operations such as pattern recognition, memory retrieval, and evaluation, offering a systematic approach to enhancing LLM reasoning and decision-making. This position paper presents a novel cognitive prompting approach for enforcing SOI-inspired reasoning for improving clarity, coherence, and adaptability in model responses.


InFL-UX: A Toolkit for Web-Based Interactive Federated Learning

arXiv.org Artificial Intelligence

The lack of direct involvement of domain experts, due to technical barriers, further delays the acquisition of new training data [7]. To address these challenges, Fails and Olsen [6] introduced interactive machine learning (IML), enabling non-technical users to train ML models using their own data through manual classification or correcting model outputs. Unlike traditional ML, IML allows real-time updates in response to user input, facilitating focused and incremental adjustments [1, 5]. Building on these advancements, Tseng et al. [17] developed Co-ML, a tablet-based application for collaboratively building ML image classification models across multiple devices, focusing on teaching dataset design practices by creating a shared dataset. In this paper, we extend these concepts by proposing a browser-based tool that allows users to collaborate on IML tasks using federated learning.


Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

arXiv.org Artificial Intelligence

Photoplethysmography (PPG) is a widely used non-invasive physiological sensing technique, suitable for various clinical applications. Such clinical applications are increasingly supported by machine learning methods, raising the question of the most appropriate input representation and model choice. Comprehensive comparisons, in particular across different input representations, are scarce. We address this gap in the research landscape by a comprehensive benchmarking study covering three kinds of input representations, interpretable features, image representations and raw waveforms, across prototypical regression and classification use cases: blood pressure and atrial fibrillation prediction. In both cases, the best results are achieved by deep neural networks operating on raw time series as input representations. Within this model class, best results are achieved by modern convolutional neural networks (CNNs). but depending on the task setup, shallow CNNs are often also very competitive. We envision that these results will be insightful for researchers to guide their choice on machine learning tasks for PPG data, even beyond the use cases presented in this work.


Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study

arXiv.org Artificial Intelligence

Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.